Class Imbalance

The harms of class imbalance corrections for machine learning based prediction models: a simulation study

Thomas Reinke

Baylor University

Theophilus A. Bediako

Baylor University

August 10, 2025

Contents

  1. Introduction
  2. Methods
  3. Results
  4. Case Study
  5. Discussion
  6. Conclusion
  7. References

Original Paper

The harms of class imbalance corrections for machine learning based prediction models: a simulation study (Carriero et al. 2024)

Introduction

Introduction

  • Risk prediction models are increasingly vital in healthcare
    • Can help determine an individual’s risk of disease
  • Data used to train these models often suffer from class imbalance, where one class (e.g., patients with a rare disease) is much smaller than the other.
  • Apply imbalance corrections (e.g., over- or under-sampling) to artificially balance the dataset
  • However, the effect of these corrections on the calibration of modern machine learning models is not always clear
    • Model calibration captures agreement between the estimated (predicted) and observed number of events
    • Poorly calibrated model over-estimates or underestimate true risks
    • Leading to poor treatment decisions
  • This study examines the impact of imbalance corrections on the performance—especially calibration—of several machine learning algorithms.

Methods

Methods

  • Implemented a simulation study to investigate the effects of imbalance corrections methods across 18 unique data-generating scenarios
  • Focused on prediction models for dichotomous risk prediction
  • Compared prediction performance of models developed with imbalanced corrected data to those without correction

Data Gathering Scenarios

  • to kable

Data Generating Mechanism

  • Data for the two classes (events and non-events) were generated from distinct multivariate normal distributions.

\[\text{Class 0:} \; \mathbf{X} \sim MVN(\mathbb{\mu_{0}, \mathbb{\Sigma_{0}}}) = MVN(\mathbf{0}, \mathbb{\Sigma_{0}})\] \[ \text{Class 1:} \; \mathbf{X} \sim MVN(\mathbb{\mu_{1}, \mathbb{\Sigma_{1}}}) = MVN(\mathbb{\Delta}_{\mu}, \mathbb{\Sigma_{0}} - \mathbb{\Delta}_{\Sigma}) \] For 8 predictors, the mean and covariance structure for class 0 was: \[ \mu_0 = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad \Sigma_0 = \begin{bmatrix} 1 & 0.2 & 0.2 & 0.2 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 1 & 0.2 & 0.2 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 1 & 0.2 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 0.2 & 1 & 0.2 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 0.2 & 0.2 & 1 & 0.2 & 0 & 0 \\ 0.2 & 0.2 & 0.2 & 0.2 & 0.2 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{bmatrix}. \]

Data Generating Mechanism

The mean and covariance structure for class 1 was: \[ \mu_1 = \begin{bmatrix} \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \\ \delta_\mu \end{bmatrix}, \quad \Sigma_1 = \begin{bmatrix} 1-\delta_\Sigma & z & z & z & z & z & 0 & 0 \\ z & 1-\delta_\Sigma & z & z & z & z & 0 & 0 \\ z & z & 1-\delta_\Sigma & z & z & z & 0 & 0 \\ z & z & z & 1-\delta_\Sigma & z & z & 0 & 0 \\ z & z & z & z & 1-\delta_\Sigma & z & 0 & 0 \\ z & z & z & z & z & 1-\delta_\Sigma & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1-\delta_\Sigma & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1-\delta_\Sigma \end{bmatrix}. \]

  • \(z = .2(1 - \delta_{\Sigma})\), to ensure equivalent correlation matrices between the two classes
  • The parameters \(\delta_{\mu}\) and \(\delta_{\Sigma}\) of each scenario were selected to get a C-statistic of 0.85, providing a stable baseline for comparison

Data Generating Mechanism

\[ C = \Phi \left(\sqrt{\Delta'_\mu ( \Sigma_0 + \Sigma_1)^{-1} \Delta_\mu} \right) \]

  • Concordance is equivalent to AUC in dichotomous case
  • Measures model discrimination
  • Captures a model’s ability to yield higher risk estimates for patients with the event than for those without the event

Model Development

  • A two-step procedure for each model:
    • Pre-process the training data with a class imbalance correction method
    • Train a machine learning algorithm on the resulting data
  • Implemented a 5x6 full-factorial design to compare predictive performance
    • 1 control, 4 corrections and 6 machine learning algorithms
    • 30 unique models were developed and compared in each of the 18 scenarios

Model Development - Imbalance Corrections

  • Five different approaches to handling class imbalance were compared:
    • Control: No correction applied, Model trained on the original, imbalanced data
    • Random Under Sampling (RUS): Randomly removes samples from the majority class to achieve balance
    • Random Over Sampling (ROS): Randomly duplicates samples from the minority class
    • SMOTE (Synthetic Minority Over-sampling Technique): Creates new, synthetic samples for the minority class by interpolating between existing ones
    • SENN (SMOTE + Edited Nearest Neighbors): A hybrid method that first applies SMOTE and then removes observations that are likely noise

Model Development - Imbalance Corrections

Model Development - Machine Learning Algorithms

  • Six machine learning algorithms, frequently used in clinical prediction, were evaluated:
    • Logistic Regression (LR)
    • Support Vector Machine (SVM)
    • Random Forest (RF)
    • XGBoost (XG)
    • Two ensemble algorithms specifically designed to handle imbalance were included:
      • RUSBoost (RB): A boosting algorithm that incorporates random undersampling in each iteration
      • EasyEnsemble (EE): A bagging-based algorithm that uses undersampling

Simulation Methods

  • For each of the 18 scenarios, 2000 independent datasets were generated
  • Each dataset was composed of a training set and a validation set that was 10 times larger to ensure stable performance evaluation
  • Models were trained on the training data and their performance was assessed on the unseen validation data

Simulation Methods

  • A logistic re-calibration step was also performed on all model predictions to see if post-hoc adjustments could fix any initial miscalibration

\[ \log \left(\frac{P(Y_i=1)}{1 - P(Y_i=1)}\right) = \beta_0 + \log\left(\frac{p_i}{1 - p_i}\right) \]

  • After the re-calibration procedure was implemented, predictive performance was then re-assessed using the re-calibrated predictions

Performance Meaures

  • Model performance was evaluated with three metrics:
    • Calibration: Assessed visually with flexible calibration curves and quantitatively with the calibration intercept (ideal=0) and calibration slope (ideal=1)
    • Discrimination: The model’s ability to separate events from non-events, measured by Concordance (ideal=1)
    • Overall Performance: A score reflecting calibration and discrimination, measured by Brier score (ideal=0)

\[ BS = \frac{1}{N} \sum\limits_{t=1}^N (f_t - o_t)^2 \]

Software & Error Handling

  • All simulations were conducted in R, using high-performance computing cluster
  • A clear error-handling protocol was established: if an imbalance correction or ML algorithm failed, the process would continue where possible (e.g., by using uncorrected data) and the failure would be logged
  • Results not fully reproducible

Results

Results

  • Across all scenarios with class imbalance, models developed without imbalance correction consistently demonstrated equal or superior calibration compared to models with corrections applied
  • Calibration:
    • Applying any imbalance correction—whether through pre-processing (RUS, ROS, SMOTE, SENN) or using specialized algorithms (RB, EE), systematically introduced miscalibration
    • This miscalibration was consistently characterized by an over-estimation of risk
  • Discrimination:
    • The impact on discrimination was inconsistent and highly dependent on the algorithm
    • Any observed benefits were generally small
  • Overall Performance:
    • The control models consistently had the best Brier scores
  • Re-calibration:
    • Post-hoc re-calibration adjusted the average predicted risk
    • Couldn’t fix the underlying miscalibration introduced by the imbalance corrections

Results

Results

Table 4a: Median performance for simulation scenarios 4-6.
Metric
Control
RUS
ROS
SMOTE
SENN
LR SVM RF XG RB EE LR SVM RF XG RB EE LR SVM RF XG RB EE LR SVM RF XG RB EE LR SVM RF XG RB EE
Scenario 4
Concordance 0.84 0.86 0.84 0.84 0.84 0.83 0.84 0.86 0.84 0.84 0.83 0.82 0.84 0.85 0.84 0.82 0.83 0.82 0.84 0.86 0.84 0.84 0.83 0.82 0.84 0.85 0.84 0.84 0.83 0.83
MCMC Error 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
Brier Score 0.16 0.15 0.16 0.16 0.17 0.19 0.16 0.15 0.16 0.16 0.17 0.19 0.16 0.16 0.17 0.19 0.17 0.19 0.16 0.15 0.16 0.16 0.17 0.19 0.17 0.17 0.17 0.18 0.17 0.17
MCMC Error <0.01 <0.01 <0.01 0.01 <0.01 <0.01 <0.01 <0.01 <0.01 0.01 0.01 <0.01 <0.01 0.01 0.01 0.01 0.01 0.01 <0.01 <0.01 <0.01 0.01 0.01 <0.01 0.01 0.01 0.01 0.01 0.01 0.01
Calib. Int. <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
MCMC Error 0.14 0.14 0.12 0.15 0.07 0.05 0.11 0.12 0.09 0.11 0.07 0.05 0.13 0.20 0.34 0.51 0.18 0.17 0.10 0.11 0.09 0.11 0.08 0.05 0.23 0.25 0.15 0.26 0.14 0.07
Calib. Slope 0.92 0.96 1.25 0.91 1.46 >10.0 0.92 0.96 1.25 0.90 1.46 >10.0 0.88 0.87 1.21 0.56 1.26 >10.0 0.90 0.94 1.21 0.89 1.43 >10.0 0.57 0.58 0.78 0.52 0.83 1.62
MCMC Error 0.10 0.11 0.18 0.12 0.17 0.17 0.11 0.11 0.18 0.12 0.17 0.17 0.12 0.11 0.19 0.10 0.15 0.16 0.10 0.11 0.18 0.12 0.16 0.16 0.19 0.20 0.27 0.21 0.33 0.34
Scenario 5
Concordance 0.84 0.82 0.82 0.82 0.83 0.82 0.83 0.85 0.83 0.80 0.81 0.80 0.83 0.83 0.83 0.79 0.80 0.80 0.84 0.83 0.82 0.81 0.80 0.81 0.83 0.83 0.82 0.82 0.81 0.82
MCMC Error 0.01 0.03 0.02 0.02 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.02
Brier Score 0.12 0.12 0.12 0.12 0.17 0.20 0.18 0.16 0.18 0.19 0.20 0.20 0.17 0.15 0.12 0.15 0.15 0.15 0.16 0.14 0.13 0.15 0.14 0.15 0.19 0.16 0.15 0.17 0.16 0.15
MCMC Error 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01
Calib. Int. <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
MCMC Error 0.22 0.23 0.18 0.23 0.10 0.07 0.40 0.23 0.18 0.36 0.14 0.08 0.16 0.24 0.17 0.43 0.17 0.10 0.18 0.30 0.18 0.33 0.20 0.10 0.30 0.40 0.24 0.46 0.25 0.10
Calib. Slope 0.85 0.97 1.27 0.79 1.52 >10.0 0.67 0.96 1.26 0.62 1.30 >10.0 0.76 0.61 1.10 0.33 0.85 1.78 0.71 0.55 0.84 0.41 0.78 1.71 0.54 0.43 0.66 0.35 0.58 1.48
MCMC Error 0.14 0.55 0.20 0.13 0.24 0.27 0.17 0.26 0.31 0.14 0.24 0.27 0.14 0.11 0.20 0.05 0.13 0.18 0.13 0.10 0.15 0.07 0.13 0.17 0.12 0.08 0.15 0.05 0.10 0.16
Scenario 6
Concordance 0.84 0.71 0.78 0.82 0.83 0.81 0.82 0.84 0.82 0.79 0.81 0.79 0.84 0.80 0.81 0.76 0.76 0.76 0.84 0.80 0.80 0.78 0.76 0.78 0.84 0.80 0.80 0.79 0.77 0.79
MCMC Error 0.01 0.04 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.03 0.03 0.03 0.01 0.02 0.02 0.02 0.03 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.02
Brier Score 0.02 0.02 0.02 0.02 0.16 0.20 0.19 0.17 0.18 0.21 0.22 0.20 0.16 0.05 0.02 0.02 0.04 0.04 0.15 0.05 0.03 0.03 0.03 0.05 0.16 0.05 0.03 0.04 0.04 0.05
MCMC Error <0.01 <0.01 <0.01 <0.01 0.02 0.01 0.04 0.03 0.03 0.04 0.03 0.02 0.01 0.01 <0.01 <0.01 <0.01 <0.01 0.02 0.01 <0.01 <0.01 <0.01 0.01 0.02 0.01 <0.01 <0.01 <0.01 0.01
Calib. Int. <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
MCMC Error <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 >10.0 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
Calib. Slope 0.90 1.41 0.63 0.84 1.49 >10.0 0.54 0.96 1.18 0.52 1.26 1.86 0.72 0.14 0.53 0.25 0.35 1.67 0.62 0.16 0.48 0.22 0.28 1.50 0.57 0.15 0.45 0.20 0.25 1.38
MCMC Error 0.12 >10.0 0.20 0.13 0.28 0.30 0.18 0.66 0.32 0.12 0.26 0.28 0.13 0.04 0.18 0.02 0.06 0.17 0.11 0.03 0.12 0.02 0.05 0.13 0.11 0.03 0.11 0.02 0.04 0.12

Results

Table 4b: Median performance for recalibrated scenarios 4-6.
Metric
Control
RUS
ROS
SMOTE
SENN
LR SVM RF XG RB EE LR SVM RF XG RB EE LR SVM RF XG RB EE LR SVM RF XG RB EE LR SVM RF XG RB EE
Scenario 4 Recalibrated
Concordance 0.84 0.84 0.83 0.82 0.84 0.85 0.84 0.82 0.83 0.82 0.84 0.86 0.86 0.84 0.84 0.83 0.82 0.84 0.85 0.84 0.84 0.83 0.84 0.86 0.84 0.83 0.84 0.84 0.83 0.84
MCMC Error 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
Brier Score 0.16 0.16 0.17 0.19 0.16 0.16 0.17 0.18 0.17 0.19 0.16 0.15 0.15 0.16 0.16 0.17 0.19 0.17 0.16 0.17 0.17 0.17 0.16 0.17 0.16 0.17 0.19 0.16 0.15 0.16
MCMC Error <0.01 0.01 0.01 <0.01 <0.01 0.01 <0.01 0.01 0.01 0.01 <0.01 <0.01 <0.01 <0.01 0.01 0.01 <0.01 0.01 0.01 0.01 0.01 0.01 <0.01 0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
Calib. Int. <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
MCMC Error <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
Calib. Slope 0.92 0.90 1.46 >10.0 0.88 0.87 1.21 0.56 1.26 >10.0 0.90 0.96 0.94 1.21 0.89 1.43 >10.0 0.57 0.58 0.78 0.52 0.83 1.62 0.92 0.96 1.25 0.91 1.46 >10.0 1.25
MCMC Error 0.10 0.12 0.17 0.17 0.12 0.11 0.19 0.10 0.15 0.16 0.10 0.11 0.11 0.18 0.12 0.16 0.16 0.19 0.20 0.27 0.21 0.33 0.34 0.11 0.11 0.18 0.12 0.17 0.17 0.18
Scenario 5 Recalibrated
Concordance 0.84 0.80 0.81 0.80 0.83 0.83 0.83 0.79 0.80 0.80 0.84 0.82 0.83 0.82 0.81 0.80 0.81 0.83 0.83 0.82 0.82 0.81 0.82 0.83 0.82 0.83 0.85 0.83 0.82 0.82
MCMC Error 0.01 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.03 0.02 0.02 0.02 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.02 0.02
Brier Score 0.12 0.13 0.13 0.14 0.12 0.13 0.12 0.15 0.13 0.13 0.12 0.12 0.13 0.12 0.14 0.13 0.13 0.13 0.14 0.13 0.15 0.13 0.12 0.13 0.12 0.12 0.13 0.13 0.12 0.12
MCMC Error 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 <0.01 0.01 0.01 0.01 0.01 0.01 0.01 <0.01 0.01 0.01 0.01 0.01 0.01 <0.01 0.01 0.01 0.01 <0.01 0.01 0.01 0.01
Calib. Int. <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
MCMC Error <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
Calib. Slope 0.85 0.62 1.30 >10.0 0.76 0.61 1.10 0.33 0.85 1.78 0.71 0.97 0.55 0.84 0.41 0.78 1.71 0.54 0.43 0.66 0.35 0.58 1.48 0.67 0.96 1.26 0.79 1.52 >10.0 1.27
MCMC Error 0.14 0.14 0.24 0.27 0.14 0.11 0.20 0.05 0.13 0.18 0.13 0.55 0.10 0.15 0.07 0.13 0.17 0.12 0.08 0.15 0.05 0.10 0.16 0.17 0.26 0.31 0.13 0.24 0.27 0.20
Scenario 6 Recalibrated
Concordance 0.84 0.79 0.81 0.79 0.84 0.80 0.80 0.76 0.76 0.76 0.84 0.71 0.80 0.80 0.78 0.76 0.78 0.84 0.80 0.80 0.79 0.77 0.78 0.82 0.83 0.81 0.82 0.84 0.82 0.79
MCMC Error 0.01 0.03 0.03 0.03 0.01 0.03 0.10 0.02 0.03 0.02 0.01 0.04 0.02 0.04 0.02 0.02 0.02 0.01 0.02 0.05 0.02 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02
Brier Score 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.03 0.02 0.03 0.02 0.02 0.02 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
MCMC Error <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
Calib. Int. <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
MCMC Error <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 >10.0 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01
Calib. Slope 0.90 0.52 1.26 1.86 0.72 0.14 0.56 0.25 0.35 1.67 0.62 1.41 0.16 0.48 0.22 0.28 1.50 0.57 0.15 0.45 0.20 0.25 1.38 0.84 1.49 >10.0 0.54 0.96 1.18 0.63
MCMC Error 0.12 0.12 0.26 0.28 0.13 0.05 0.16 0.02 0.06 0.17 0.11 >10.0 0.03 0.12 0.02 0.05 0.13 0.11 0.03 0.11 0.02 0.04 0.12 0.13 0.28 0.30 0.18 0.66 0.32 0.20

MIMIC-III Data Case Study

MIMIC-III Data Case Study

  • Goal: To test if the simulation findings hold true on real-world dataset.
  • Data: The MIMIC-III database was used to develop models predicting 90-day mortality for ICU patients
    • The dataset had a event fraction of 0.17
  • Methods: Same 30 model-building pipelines from simulation applied to MIMIC-III data
  • Findings:
    • The case study results strongly corroborated the simulation findings
    • Every model that used an imbalance correction exhibited significant miscalibration, systematically overestimating the risk of mortality
    • These models also had worse overall performance (Brier score) compared to their uncorrected counterparts
  • Two slides
  • Case study overview
  • Case study results

Discussion

Discussion

  • This study provides strong evidence that for developing calibrated clinical prediction models, applying common imbalance corrections is often harmful
  • The primary harm is a systematic overestimation of risk, which can lead to poor clinical decisions
    • This miscalibration is not easily fixed by post-hoc methods
  • The potential gains in discrimination from corrections don’t outweigh significant cost to calibration
  • Standard ML algorithms are often surprisingly robust and produce well-calibrated models when trained directly on imbalanced data
  • Limitations: The study was confined to low-dimensional settings * Further research could explore higher dimensions

Conclusion

Conclusion

  • Correcting for class imbalance is a widely used technique, but potential negative impact on model calibration
  • When the goal is to produce reliable and accurate risk estimates for individual patients, applying imbalance corrections may do more harm than good
  • Researchers and practitioners should be cautious and prioritize model calibration over class imbalance

References

References

Carriero, Alex, Kim Luijken, Anne de Hond, Karel GM Moons, Ben van Calster, and Maarten van Smeden. 2024. “The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study.” https://arxiv.org/abs/2404.19494.
Goorbergh, Ruben van den, Maarten van Smeden, Dirk Timmerman, and Ben Van Calster. 2022. “The Harm of Class Imbalance Corrections for Risk Prediction Models: Illustration and Simulation Using Logistic Regression.” https://arxiv.org/abs/2202.09101.